The IBM® Speech to Text service provides an Application Programming Interface (API) that lets you add speech transcription capabilities to your applications. To transcribe the human voice accurately, the service leverages machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. The service continuously returns and retroactively updates the transcription as more speech is heard.
The service provides a variety of interfaces to suit the needs of your application. It supports many features that make it suitable for numerous use-cases. And it provides a customization interface that lets you enhance its base language and acoustic capabilities with vocabularies and acoustic characteristics specific to your domain, environment, and speakers. Supported interfaces
The Speech to Text service offers four interfaces:
SDKs are also available that simplify using the service's interfaces in various programming languages. For more information about application development with the service, see Overview for developers.
An optional customization ID for a custom acoustic model that is adapted for the acoustic characteristics of your environment and speakers. By default, no custom model is used. See Custom models.
An optional customization ID for a custom language model that includes terminology from your domain. By default, no custom model is used. See Custom models.
An optional double between 0.0 and 1.0 that indicates the relative weight that the service gives to words from a custom language model compared to those from the base vocabulary. The default is 0.3 unless a different weight was specified when the custom language model was trained. See Custom models.
An optional integer that specifies the number of seconds for the service's inactivity timeout; use -1 to indicate infinity. The default is 30 seconds. See Inactivity timeout.
An optional boolean that directs the service to return intermediate hypotheses that are likely to change before the final transcript. By default (false), interim results are not returned. See Interim results.
An optional array of keyword strings that the service spots in the input audio. By default, keyword spotting is not performed. See Keyword spotting.
An optional double between 0.0 and 1.0 that indicates the minimum threshold for a positive keyword match. By default, keyword spotting is not performed. See Keyword spotting.
An optional integer that specifies the maximum number of alternative hypotheses that the service returns. By default, the service returns a single final hypothesis. See Maximum alternatives.
An optional model that specifies the language in which the audio is spoken and the rate at which it was sampled, broadband or narrowband. By default, the en-US_BroadbandModel model is used. See Language and models.
An optional boolean that indicates whether the service censors profanity from a transcript. By default (true), profanity is filtered from the transcript. See Profanity filtering.
An optional boolean that indicates whether the service converts dates, times, numbers, currency, and similar values into more conventional representations in the final transcript. By default (false), smart formatting is not performed. See Smart formatting.
An optional boolean that indicates whether the service identifies which individuals spoke which words in a multi-participant exchange. By default (false), speaker labels are not returned. See Speaker labels.
An optional boolean that indicates whether the service produces timestamps for the words of the transcript. By default (false), timestamps are not returned. See Word timestamps.
An optional value of chunked that causes the audio to be streamed to the service. By default, audio is sent all at once as a one-shot delivery. See Audio transmission.
An optional authentication token that makes authenticated requests to the service without embedding your service credentials in every call. By default, service credentials must be passed with each request. See Authentication tokens and request logging.
An optional double between 0.0 and 1.0 that specifies the threshold at which the service reports acoustically similar alternatives for words of the input audio. By default, word alternatives are not returned. See Word alternatives.
An optional boolean that indicates whether the service provides confidence measures for the words of the transcript. By default (false), word confidence measures are not returned. See Word confidence.
An optional authentication token that makes authenticated requests to the service without embedding your service credentials in every call. By default, service credentials must be passed with each request. See Authentication tokens and request logging.
An optional boolean that indicates whether you opt out of the request logging that IBM performs to improve the service for future users. By default (false), request logging is performed. See Authentication tokens and request logging.
In [ ]:
!pip install --upgrade watson_developer_cloud
import requests
import json
import os
from os.path import join, dirname
from watson_developer_cloud import SpeechToTextV1
In [ ]:
# @hidden_cell
url = "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
username= "$USERNAME"
password= "$PASSWORD"
file1 = "https://github.com/krondor/nlp-dsx-pot/raw/master/aging.mp3"
file2 = "http://podcast.c-span.org/podcast/SBHAR1020.mp3"
file3 = "https://github.com/krondor/nlp-dsx-pot/raw/master/reagan-thatcher.mp3"
In [ ]:
!wget {file1} -O aging.mp3 -nc
# Define Local File for CURL
filepath = './reagan-thatcher.mp3'
!curl -X POST -u {username}:{password} \
--header "Content-Type: audio/mp3" \
--data-binary @{filepath} \
"https://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
By default, the service returns a basic transcription of the input audio. In this example we will use the requests library to structure our post and pass parameters to the service. We will define diarization requirement and select our speaking model. The results will be passed to a pandas data frame and subsequently analyzed.
In [ ]:
!wget {file3} -O reagan-thatcher.mp3 -nc
filename = os.path.basename(filepath)
audio = open(filename,'rb')
files_input = {
"audioFile":(filename,audio,'audio/mp3')
}
# Define Speech to Text Feature Parameters
params = (
('model', 'en-US_NarrowbandModel'),
('speaker_labels', 'true')
)
response = requests.post(
url,
params=params,
auth=(username, password),
headers={"Content-Type": "audio/mp3"},
files=files_input)
response_data = response.json()
print('status_code: {} (reason: {})'.format(response.status_code, response.reason))
In [ ]:
import pandas as pd
data = []
for item in response_data['results']:
for trans in item['alternatives']:
data.append(dict({'transcript':trans['transcript'], 'confidence':trans['confidence']}))
# Create Pandas Data Frame of Transcript Results with Confidence
df = pd.DataFrame(data)
# View Snippet
df.head(5)
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib
matplotlib.style.use('ggplot')
plt.figure();
df['confidence'].plot.hist()
In [ ]:
speech_to_text = SpeechToTextV1(
username='d9f7864a-0869-40ee-98af-58e23e996a0e',
password='nYmlwq7VBTZz',
x_watson_learning_opt_out=False
)
!wget {file1} -O aging.mp3 -nc
filepath = './aging.mp3' # path to file
filename = os.path.basename(filepath)
print(json.dumps(speech_to_text.models(), indent=2))
print(json.dumps(speech_to_text.get_model('en-US_BroadbandModel'), indent=2))
with open(filename, 'rb') as audio_file:
print(json.dumps(speech_to_text.recognize(
audio_file, content_type='audio/mp3', timestamps=True,
word_confidence=True, speaker_labels=True),
indent=2))
In [ ]: